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ABSTRACT 

In the course of music recording sessions, the same vocal or 
instrumental passages are usually performed several times. 
However, only the best takes are chosen and further proces- 
sed. Especially for lead vocals and solo instruments, the quan- 
tity of recorded material can be overwhelming, which makes 
the selection process time-consuming. Our goal is to auto- 
mate and objectify this procedure in order to assist music 
producers for a faster decision making. The task of auto- 
matic best take detection is constrained to monophonic lines 
of electric guitar and singing voice in popular music. Ass- 
uming realistic scenarios during recording sessions, the pro- 
posed system requires only a synchronized click track and a 
backing track with accompanying instruments to be available 
for analysis. 


1. INTRODUCTION 

In the context of music studio recordings, our study deals with 
the question how the selection of the best take could be as- 
sisted by means of autonomous Music Information Retrieval 
(MIR) techniques. Recent publications 00 cover several 
relevant aspects of music performance analysis such as into- 
nation and timing. However, the process of best take detec- 
tion itself is — to the best knowledge of the authors — still an 
unexplored research field. 

Our primary goal is to automatically produce a ranking of 
a given set of the recorded takes, ordered from best to worst. 
The ordering dimension which has to be estimated is deno- 
ted as music performance quality (MPQ), which according 
to Williamon and Valentine is defined as the overall presen- 
ted and subject-specific ability. This ability consists of a de- 
fined collectivity of metrics, describing the three aspects of 
(1) musical understanding, (2) communicative ability, and (3) 
technical proficiency |6j. One of the main problems is that 
assessing MPQ is subjective to a certain extend. Therefore, 
one challenge is to identify objective criteria. 

Another general problem in this area is the differentiation 
between musical shortcoming and intention. Similar shapes 
of phrasing are assessed differently in their MPQ. Depending 
on their underlying scheme or interpretive context, they can 
be perceived as both good or bad. For instance, unsystematic 
pitch modulation is usually perceived as tonal instability (and 
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Figure 1 : Structure & signal flow of the proposed system 


hence bad), while small, cyclic, regular pitch modulation — 
known as vibrato — is mostly interpreted as indication for a 
sophisticated technical proficiency. The same is true for time 
variation: If tempo is changed systematically, this behaviour 
is identified as micro-timing instead of rhythm inaccuracies. 
MPQ is highly context-sensitive in general. A heavy rough 
vocal timbre is appropriate in a rock song, but might be con- 
sidered as misplaced in a pop ballad, for example. 

2. PROPOSED FRAMEWORK 

As shown in Figure 1, the proposed framework requires three 
types of input data — the audio from the musician’s perfor- 
mance as well as the backing track and a click track provi- 
ding harmonic and metrical information. Additionally, the 
equal tempered scale (ETS) is used as a reference. Automatic 
melody transcription and fundamental frequency (/o) contour 
estimation are performed prior to the feature extraction 0 - 
Based on these melody representations, a set of timing and 
intonation features are computed. 

The feature set was built up systematically based on a for- 
med taxonomy to assure that all relevant areas are covered by 
at least one feature. Therefore, we followed the recommenda- 
tion from [8] in designing distinct rubrics to form a rule set for 
the assessment task. In accordance with the four domains to- 
nality/pitch (including melody and harmony), rhythm, inten- 
sity, and timbre, it is assumed that MPQ can be derived from 
the four musical rubrics of intonation, timing, dynamics, and 
sounding. 

In a preliminary study, we conducted expert interviews. 
The experts’ subject-specific knowledge provided further in- 
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Timing 

Intonation 

Local (note-wise) 

Note onset accuracy 
Note offset accuracy 
Note duration accuracy 

Pitch accuracy 
Pitch stability 
Pitch drift amount 
Scooping amount 
Vibrato amount 

Regional (pattern-wise) 

Relative timing 

Relative intonation 

Global (segment-wise) 

Overall static time shift 

Overall static pitch shift 


Table 1: Defined metrics for the systematic development of 
features 

sight to the task at hand. Since the professional audio en- 
gineers confirmed our assumption that dynamics and soun- 
ding were highly subjective, only intonation and timing are 
further pursued. Both can be analyzed locally (note-wise), 
regionally (pattern-wise), and globally (segment-wise) which 
leads to the metrics shown in table |T] 

Scooping means that singers slide into notes, starting each 
phrase on a low or indeterminate pitch beneath the note, then 
correcting it. While a scooped tone is well intonated, about 60 
percent of the time the right pitch is not reached. Pitch drift 
is the opposed shape of a similar pitch modulation. While the 
correct pitch is reached at the beginning, the singer is drifting 
down in his intonation. 

Based on the literature of related work, we chose four dif- 
ferent intermediate musical representations. Such features in- 
clude tempogram, distance, time quantization costs to binary 
/ ternary grids pitch class histogram distances, pitch quanti- 
zation costs towards the equal tempered scale, pitch stability 
measure, (/ 0 ) slope and vibrato likelihood. 

3. EVALUATION 

The feature extraction step described in the previous section 
lead to a 88-dimensional feature vectors. In a first step, an 
ordinal classification is applied to determine the rank. For 
this purpose, several machine learning approaches have been 
attempted, including Support Vector Machine (SVM) using 
different kernel functions, Gaussian Mixture Models (GMM), 
and different types of regression. For our data set, partial least 
square regression (PLSR) performed best w.r.t. Kendall rank 
correlation (Kendall’s r). Using a novel dataset of 300 short 
monophonic guitar and vocal audio snippets with a total du- 
ration of 78 minutes, we achieve best performance values of 
r = 0.68 (guitar) and r = 0.53 (vocals). 

4. CONCLUSION 

The proposed system achieves good performance for the newly- 
defined task of best take detection. In general, the difficulty 


to differentiate between intention and deficiency remains the 
main challenge of the proposed task. Additionally, the amount 
of training data could be further increased to better represent 
different performance levels and music styles. The results for 
guitar are better than for vocals. One reason for this behavior 
is the higher error rate of the automatic transcription for vo- 
cals. While guitar transcriptions reach a F-measure of 0.91, 
for vocal transcription, the F-measure is merely 0.70. Trans- 
cription errors propagate to the feature extraction stage. Fur- 
thermore, we observed that even trained human raters does 
not agree in all cases. 
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